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ABSTRACT 


Dynamic monitoring of software and system health of a com- 
plex cyber-physical system requires observers that continu- 
ously monitor variables of the embedded software in order to 
detect anomalies and reason about their root causes. There 
exists a variety of techniques for code instrumentation, but 
instrumentation might change runtime behavior and could re- 
quire costly software re-certification. 


In this paper, we present R2U2/E, a novel realization of 
our real-time, Realizable, Responsive, and Unobtrusive Unit 
(R2U2). The R2U2/E observers are executed in parallel on 
a dedicated 16 or 64 core EPIPHANY co-processor, thereby 
avoiding additional computational overhead to the system un- 
der observation. A DMA-based shared memory access archi- 
tecture allows R2U2/E to operate without any code instru- 
mentation or program interference. 


1. INTRODUCTION 


Modern cyber-physical systems, like unmanned aircraft 
(UAS), autonomous vehicles, or space systems are equipped 
with numerous sensors that make it possible for the system 
to perceive its environment and enable accurate guidance, 
navigation, and control. Measurements of these sensors 
need to be processed in real-time by a software system of 
considerable size. Functions for advanced autonomous op- 
erations, decision making, and planning substantially add to 
the complexity of the software. This software, which needs 
to be executed on board of the aircraft obviously is highly 
safety-critical: failures can cause not only loss of the vehicle 
and an unsuccessful end of the mission, but also might harm 
human life. Therefore, such software must undergo rigorous 
certification. 
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Nevertheless, such cyber-physical systems can fail due to a 
multitude of reasons. Faulty sensors, unreliable communica- 
tion between sensor subsystem and flight computer, errors in 
the design or implementation of the control system or control 
logic can cause problems that could lead to total loss of the 
mission or destruction of the vehicle. 


If, however, off-nominal situations can be detected reliably, 
on-board and in real-time, then mitigation actions can be trig- 
gered, or dangerous actions avoided in the first place. 


For example, the European Mars lander, Schiaparelli encoun- 
tered several errors during descent and landing, which caused 
the probe to shut down its retro rockets at an altitude of sev- 
eral kilometers, causing it to crash into the planet with a ve- 
locity of about 150m/s. A detailed incident report (Tolker- 
Nielsen, 2017) found out that saturated values of the inertial 
measurement units were handled wrongly by the guidance 
and navigation software — it suddenly estimated a negative 
altitude. Based upon those wrong estimates, the weak deci- 
sion logic assumed that landing had taken place, turned off 
the rockets, and activated post-landing procedures. However, 
the probe was still about 3.7km above the ground. Redundant 
instruments, like a Radar Doppler Altimeter (RDA), had pro- 
duced correct measurements, but they had been ignored. The 
incident report therefore recommends, among others (recom- 
mendation 05, (Tolker-Nielsen, 2017)): 


Robust and reliable sanity checks shall be implemented in the on- 
board S/W to increase the robustness of the design, which could be, 
but not limited to: 


e Check on attitude 

e Check on altitude sign (altitude cannot be negative). 

e Check on vertical acceleration during terminal descent and 
landing (cannot be higher than gravity). 

e Check altitude magnitude change (it cannot change from 3.7 
Km to a negative value in one second). 

e Check wrt pre-flight timeline (altitude or acceleration profile vs 
time) to check consistency of measurements 
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Our system R2U2 (real-time, Realizable, Responsive, and 
Unobtrusive Unit) has been designed to carry out checks 
about safety and performance properties, sensor and soft- 
ware consistency, as well as performing diagnostic reason- 
ing and prognostics in real time while the system is in op- 
eration. R2U2 combines signal processing, Metric Temporal 
Logic, Bayesian Networks, and model-based prognostics to 
enable the system designers to develop powerful and expres- 
sive models. Checks, like those mentioned above could be 
easily modeled within R2U2. A check against the pre-flight 
timeline could look like 


RDA-switch-on — ¢/25,35)Backshell-separation 


After the RDA has been switched on, the backshell needs to 
separate within the next 25 to 35 seconds (cf. (Tolker-Nielsen, 
2017), Fig. 2). R2U2 can perform numerous checks and per- 
form diagnostic reasoning for root cause analysis in case of 
an anomaly. 


However, this checking has to come with a prize: usually, 
run-time monitoring is done using software observers that are 
brought into the code by instrumentation of the flight soft- 
ware. Although a large number of approaches exist, there are 
two severe drawbacks: additional burden on the CPU load of 
the flight computer and software safety/certification issues. 


It is obvious, that any additional code that must be executed 
within the inner control-loop of a safety-critical system can 
alter its runtime behavior. That may lead to violation of real- 
time constraints, time overruns, or missed cycles. In particu- 
lar, on small and weak flight computers as are often used on 
drones or small spacecraft, that can lead to severe misbehav- 
ior of the software and subsequent system crashes. There are 
stories that small drones of a student project crashed during 
flight, because the students simply added some printf state- 
ments to log additional variables (pers. comm). So, a suc- 
cessful monitoring must be unobtrusive in the sense that it 
should not change the temporal behavior of the software or 
change the CPU burden. 


Flight software, as mentioned earlier, is highly safety-critical. 
Therefore, regulations and standards require that this soft- 
ware is certified according prescribed, published standards. 
For example, DO-178C (RTCA, 2012) defines software de- 
velopment standards for safety-critical software in commer- 
cial transport aircraft. Certification is a highly complex, 
costly and time-consuming process, where the software, after 
passing is not allowed to be modified again. So, the addition 
of run-time monitors to the software would require a full soft- 
ware re-certification, something that is, in most cases, out of 
the question. 


In this paper, we present R2U2/E, a realization of R2U2, that 
combines high performance, low power requirements with an 
extremely high level of unobtrusiveness. Previous R2U2 ver- 


sions (Geist, Rozier, & Schumann, 2014; Reinbacher, Rozier, 
& Schumann, 2014; Schumann, Rozier, et al., 2015) have 
been developed on a dedicated FPGA chip in order to ad- 
dress the above-mentioned challenges. The new monitoring 
and reasoning engines of R2U2/E are executed on a modern 
EPIPHANY chip (Olofsson, Nordstrom, & Zain-ul-Abdin, 
2014), a high-performance, energy-efficient MIMD architec- 
ture with an efficient 2D mesh Network-on-Chip and a dis- 
tributed shared memory model. Designed as a powerful co- 
processor for numeric computation in real-time embedded 
systems, this architecture combines low power requirements 
with a high degree of parallel execution. 


We are using the EPIPHANY chip as a co-processor to a 
Zynq 7000 Series SoC system, that is running a Linux op- 
erating system with the flight software. A built-in frame- 
buffer architecture allows the EPIPHANY chip to access cer- 
tain memory locations of the main memory using direct mem- 
ory access (DMA). 


R2U2/E is running in parallel on the EPIPHANY chip and 
fetches the values of variables to be monitored from the main 
processor’s memory. Using this novel architecture, we can 
monitor sensor variables and the flight software without hav- 
ing to instrument or modify the software. 


Our approach addresses the two main challenges for instru- 
mentation, (Watterson & Heffernan, 2007): (1) probes must 
be capable of observing enough relevant information in order 
to determine the system’s state, and (2) the behavior of the ob- 
served system must not be affected. In the application domain 
of safety-critical systems (e.g., an autopilot), critical data are 
usually placed in static (dedicated) memory locations. Hence, 
we have a prior knowledge or can determine the memory lo- 
cation of relevant data for monitoring on startup of the sys- 
tem. Besides, the memory access by means of a dedicated 
DMA channel, which is executed by a dedicated processing 
unit operates not only independently, but does not cause ad- 
ditional computational overhead on the observed system or 
alter its behavior. 


The main contributions of this paper are: 


e development of an unobtrusive Co-processor based mon- 
itoring architecture for R2U2/E. The flight software is 
running on the main processor and can be monitored in 
real-time without having to instrument it, 


e development of a parallel execution architecture for 
R2U2/E, which uses the available processors to execute 
the various R2U2 components in parallel, and 


e performance and power consumption analysis of 
R2U2/E running on a Parallella board (Adapteva, 2017) 
with a 16 and 64 core EPIPHANY co-processor. 


The rest of this paper is structured as follows: In Section 2 we 
discuss related work. Section 3 will give a short overview of 
R2U2 and its implementation variants, as well as a short de- 
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scription of the EPIPHANY multiprocessor and the Parallella 
board. Section 4 discusses details of the parallel architecture 
and the unobtrusive access techniques used by R2U2/E. In 
Section 5, we show results of experiments on measuring the 
performance of R2U2/E on the Parallella board. Section 6 
concludes and discusses future work. 


2. RELATED WORK 


Tsai, Fang, Chen, and Bi (1990) and Watterson and Heffernan 
(2007) discuss non-intrusive monitoring. A dedicated moni- 
toring system taps into the communication bus between the 
processor and the memory. This requires that each mem- 
ory access (also non-related) is processed immediately, which 
forces the monitoring system to operate event triggered at a 
high rate. Reinbacher, Fiigger, and Brauer (2013), Heffernan, 
Macnamee, and Fogarty (2014), and BusMOP (Pellizzoni, 
Meredith, Caccamo, & Rosu, 2008; Meredith, Jin, Griffith, 
Chen, & Rogu, 2012) follow a similar non-intrusive approach 
to tap into a communication bus and use an FPGA implemen- 
tation in order to achieve the required performance. R2U2/E, 
on the other hand, does not need to monitor the complete bus- 
traffic but can access an arbitrary memory location at a spe- 
cific time since it uses a dedicated DMA channel. Therefore, 
it can operate independently of the system under observation. 


Reinbacher, Brauer, Horauer, Steininger, and Kowalewski 
(2014); Reinbacher, Geist, Moosbrugger, Horauer, and 
Steininger (2012) use a parallel FPGA design for creating 
runtime monitors. Similar to R2U2/E, Berkovich, Bonakdar- 
pour, and Fischmeister (2015) use a dedicated multicore 
architecture for optimizing the performance of runtime mon- 
itors. Their monitors run on a dedicated GPU in order to 
minimize the overhead on the system under observation. 
They generate an instrumented C program from a specifi- 
cation, whereas R2U2/E follows a non-intrusive approach. 
Typical GPUs have a high performance, but with a power 
consumption that can be orders of magnitude higher than the 
EPIPHANY chip, making that approach less suitable for low 
power embedded applications. 


3. BACKGROUND 
3.1. R2U2 


R2U2 has been designed as tool for continuous monitoring 
and system/software health management. Properties can be 
specified in past-time and future-time Metric Temporal Logic 
(MTL) as well as “mission time” Linear Temporal Logic. 
Bayesian Networks can be used to perform probabilistic di- 
agnostic reasoning and root cause analysis. The R2U2 engine 
receives a vector of sensor signals and values of variables at 
a certain rate (Figure 1A). The atomic propositional check- 
ing (AT) unit performs various forms of signal processing, 
filtering, and component prognostics (Schumann, Roychoud- 
hury, & Kulkarni, 2015) before the values are discretized. 


The temporal logic processing units (TL) are implemented as 
special-purpose processors and use advanced monitoring al- 
gorithms using storage queues (Reinbacher, Rozier, & Schu- 
mann, 2014), which requires only a small memory footprint 
and enables fast processing. Outputs of TL can be fed into 
the Bayesian Network (BN) execution unit, which calculates 
posterior probabilities of components and failure modes. BN 
uses an efficient representation of the Bayesian Network as 
an Arithmetic Circuit (?, ?; Schumann, Rozier, et al., 2015). 


monitore monitore monitore 
signals signals signals 
(heal ea a {| | | 
ea (eal (lea 
AT AT ko |Z AT — |, 
a : a 
BN “28 aR 2 BN z 
A B Cc 


Figure 1. R2U2 Versions 


R2U2 has been developed in three main versions: the FPGA 
standalone version (Reinbacher, Rozier, & Schumann, 2014) 
(Figure 1A) was developed for high performance applications 
and resilience against tampering. Its AT, BN, and TL compo- 
nents communicate via an internal communication bus. Each 
of these building blocks can be instantiated as an arbitrary 
number of parallel copies, depending on the required per- 
formance and available FPGA resources. The R2U2 FPGA 
version has been used on different case studies, as for exam- 
ple, on the NASA Swift UAS, where it was able to detect a 
failing altimeter, a wrong configured magnetometer, or pitch 
oscillation induced by faults originating from the file system 
(Schumann et al., 2013; Geist et al., 2014). 


The hybrid version of R2U2 instantiates the AT and BN 
building blocks either as software components, or as part 
of the FPGA design (Figure 1B). To that end, we utilize a 
Xilinx Zynq SoC-FPGA chip which enables us to execute 
software on the embedded ARM CPU, instantiate the FPGA 
building blocks in the programmable logic section, and use 
chip-internal communication interfaces between these com- 
ponents. This version benefits from the flexibility of software 
development and allows to steer the tradeoff between per- 
formance and resource consumption. The hybrid version of 
R2U2 has been used on the NASA Dragoneye UAS to mon- 
itor and diagnose safety and security threats as, for example, 
GPS Spoofing, or malicious attacks through attack patterns 
(Moosbrugger, Rozier, & Schumann, 2017). It also has been 
used for on-board battery prognostics (Schumann, Roychoud- 
hury, & Kulkarni, 2015). 


The R2U2 software-only version (Figure 1C) has been devel- 
oped to run exclusively as a software component, opening up 
new application areas. This R2U2 variant has be instantiated 
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as a software “app” of the NASA Autonomy Operating Sys- 
tem (AOS) (Lowry, Rayadurgam, Schumann, Pressburger, & 
Dalal, 2017) to provide monitoring and diagnostic capabil- 
ities for autonomous UAS operations. Our different R2U2 
versions use the same modeling tool-chain in order to facili- 
tate model interchange and reuse. 


3.2. EPIPHANY and Parallella 


The Parallella board is a credit-card sized high performance 
computer featuring a dual-core ARM A9 processor and a 16 
or 64 core EPIPHANY co-processor. This co-processor con- 
sists of a scalable array of simple RISC processors with a 
fast floating-point arithmetic unit. This MIMD (Multiple In- 
struction, Multiple Data) mesh of independent cores is con- 
nected together with a fast on-chip network within a dis- 
tributed shared memory architecture (Olofsson et al., 2014). 
A low-level library provides access mechanisms to the on- 
chip distributed memory and the memory of the ARM pro- 
cessor, Mutexes and other synchronization mechanisms, as 
well as utility functions to load and control each of the cores. 
Each core has two dedicated DMA channels for high-speed 
data transfer between the different on- and off-chip memory 
regions. 


According to the EPIPHANY datasheet (Adapteva, 2014) the 
processor is tailored to low power applications and capable 
to achieve 102 GFLOPS peak performance while consuming 
less than 2 Watts. 


4. PARALLEL ARCHITECTURE OF R2U2/E 


The R2U2/E architecture has to meet two important goals: 
performance gain of the R2U2 monitoring and reasoning 
components by exploiting the multicore architecture of the 
EPIPHANY chip, and a highly unobtrusive access to the vari- 
ables of the system under observation (SuO) running on the 
main processor of the Parallella board. 


4.1. Architecture 


ARM-CPU 
i+—> SuO 

Pat 

coal 

lo} 

: | EPIPHANY 

sl“) R2U2/E 

Ss A A A 

& v v v 
AT {| TL IBN 


Figure 2. Architecture of R2U2/E on Parallella 


Figure 2 shows an overview of the architecture of R2U2/E. 
The system under observation (SuO), e.g., the autopilot soft- 


ware, is executed on the ARM CPU of the Parallella board as 
a Linux process and can communicate with external subsys- 
tems and sensors (not shown here). The SuO is configured 
in such a way, that global variables of interest are located 
in a specific region of the memory, which can be accessed 
by R2U2 running on the EPIPHANY chip. A hardware- 
based DMA (direct memory access) mechanism allows the 
EPIPHANY hardware interface to copy data from specific 
parts of the DRAM into the local memory of the EPIPHANY 
chip. This low-level access and synchronization is imple- 
mented in hardware in such a way that the Linux process is 
oblivious to this access and no process- or context switches 
or interrupts need to occur. With this architecture, the SuO 
need not be instrumented for monitoring, and its execution is 
not unduly affected by the monitoring. 


R2U2/E itself resides on the cores of the EPIPHANY chip. 
An additional program, running as a Linux process on the 
ARM CPU is in charge of initializing the cores, coordinating 
the data transfer, controlling the time stamps, or to start miti- 
gation actions. For clarity, Figure 2 does not show this com- 
ponent. The R2U2/E master on one of the cores is in charge 
of controlling the data transfer and setting up for the worker 
cores that contain distributed versions R2U2 components. In 
the following, we will present the detailed architecture and 
data flow on the EPIPHANY chip as well as the access mech- 
anisms to the memory of the ARM CPU. 


4.2. Parallel Execution 


The execution model of R2U2/E attempts to obtain speed- 
ups with a two-level parallelization scheme: (a) execute the 
R2U2 components (AT, BN, TL) in parallel to the master M 
that is handling the data transfer, and (b) execute the AT and 
TL component in parallel by splitting up the R2U2 model. 
Each of the R2U2 components can be seen a functional block 
that take an input vector J; at time ¢ and produces an output 
vector O,4 1. Signal values S; comprise the input to AT. 


In our model, we run all worker processes in a synchronous 
loop, as shown in Algorithm 1, synchronized by barriers. By 
using two sets of input and output vectors, the individual com- 
ponents of R2U2/E TL can be executed in parallel on differ- 
ent cores. Because the workers need to have access to the 
previous values of inputs and outputs, we use buffers B/ and 
B° that point to the respective vectors in order to avoid un- 
necessary copies. These buffers and the vectors are located in 
the EPIPHANY RAM that can be accessed by all cores. 


For the inner level of parallelism, we employ the fact that an 
R2U2 model is highly modular: a model consists of a num- 
ber of filters that work on the input signals, which can be 
Booleans, integers, floating point numbers, but also strings or 
complex data structures. Each filter produces, for each time 
stamp t, a single element of the input vector J. Therefore, 
we can, without any additional need for synchronization, run 
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Algorithm 1 R2U2/E. Initially: v = true, t = 0 


Bbw ih, i ., La); Be -— (Or, ng ., Oa); BS -— [S1, So] 
while true do 

in parallel { 

M ry 


if v then 
swap(B/, BS); BP — DMA; 
R2U2 + BY; swap(B?, BY); 
else 
swap(B, BI); BY — DMA; 
R2U2 «+ B®; swap(B?, B®); 
end if 
AT": 
if v then 
Bi © AT(BS); 
else 
B3 + AT(BP); 
end if 
TL’: 
if v then 
BP — TL(B3, Bi, BP) 
else 
BS <— TL(Bj, B;, BP) 
end if 
BN’: 
if v then 
B? + BN(Bi) 
else 
BY — BN(BI) 
end if 
} barrier 
titt+lucyv 
end while 


different AT filters on different cores. In a similar way, the 
R2U2 model consists of numerous individual temporal for- 
mulas that can be executed independently on different pro- 
cessors. Figure 3 shows the situation, where the TL engines 
have access to all elements of the current input vector B’, but 
only write to disjoint subsets of the output vector B°. 


| 


SAHARED RAM 


AT inp SAHARED RAM 


TL results 


Figure 3. Memory access for R2U2/E cores. Input and output 
vectors used shared memory on the EPIPHANY chip 


Here, our parallel execution model builds upon the memory- 
mesh architecture of EPIPHANY, which allows multiple 


cores to read the same memory cell. Unprotected writes are 
only possible by one processor. The mapping of the elements 
of the R2U2 model will be described below. In case, AT must 
access results of TL or BN, as, for example, to discretize a 
posterior probability, those results are fed back to the input 
vector and will be processed at the next time stamp as is done 
in all R2U2 implementations. 


4.3. Load Balancing 


The maximum update rate R,,,, of R2U2/E is governed by 
the maximum of the execution times, the master and each of 
the workers need for one update step. It is therefore essential 
to break down the individual components of the R2U2 model 
into subsets that roughly exhibit the same execution time. 


Each AT filter, the execution of a Bayesian network, and 
the temporal logic operators have bounded, statically deter- 
minable execution times (for proofs see (Reinbacher, Rozier, 
& Schumann, 2014)), We are currently using a simple static 
allocation method that is based upon averages of measured 
execution times for the individual formula components. The 
actual time-stamp rate R22 must be set to a value smaller 
than Raz in order to avoid timing skews. 


EPIPHANY 


Figure 4. Architecture of R2U2/E and SuO 


4.4. Access to Software under Observation 


The software to be monitored by R2U2, the software under 
observation (SuO) is executed as a regular process on the 
Linux system; global variables that are of interest for R2U2 
are located in the system’s RAM memory. Ideally, R2U2 only 
needs to get to know the physical memory addresses of those 
global variables and then could pull their values using a direct 
memory access (DMA) channel. This unprotected read-only 
access is possible on the Parallella board. However, it would 
require low-level code modifications. 


For this paper, we therefore use a different approach (Fig- 
ure 4): the Parallella board has a built-in frame buffer, that 
is usually used to drive the video display. Technically, this 
is nothing more than a region of shared memory that can be 
accessed by other processes or the EPIPHANY cores. Since 
R2U2 is only doing read accesses to this memory no addi- 
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tional access protection is needed. The loader of the SuO is 
directed to place all global variables into the frame buffer. No 
instrumentation or other modification of the code is required. 


The R2U2 process on the host is in charge of pulling the 
global variables from the frame buffer into a shared buffer 
that can be accessed by the EPIPHANY cores in regular in- 
tervals. Its main loop then notifies the R2U2 master on the 
EPIPHANY that new data are available and that processing 
by the EPIPHANY worker cores should start. After paral- 
lel execution, results are read back into the host memory and 
logged or used to trigger mitigation actions. Figure 4 shows 
the various components of R2U2 and the interaction between 
their memory regions. 


5. EXPERIMENTS AND RESULTS 


In this section, we present results on experiments to evalu- 
ate the performance of the R2U2/E architecture. We have 
instantiated this architecture for a 16 core EPIPHANY chip 
mounted on a Parallella board. The SuO and the host R2U2 
components are running as individual Linux processes. For 
this paper, we restrict ourselves to only analyze the run-time 
behavior of the temporal processing unit (TL). 


In a first experiment, we evaluated the overhead the SuO has 
to face, when it is writing variables into the framebuffer. For 
an inner-loop access, the time to write a variable was about 
7.5ns and did not change noticeable from the time needed 
to write to the non-shared global memory. The time also did 
not change measurably, when R2U2 was accessing the shared 
frame buffer. This indicates that monitoring via R2U2/E can 
be done in a very unobtrusive manner. 


We then determined the execution times for a single update 
of the R2U2 master running on the EPIPHANY chip. Table 1 
shows that it takes considerable time on the host component 
to package the data, start the master, which in turn fetches 
the data and sends back results. As expected, most of the 
time is spent copying the input and result data between the 
processors. A single update cycle of the R2U2 master takes 
about 13;:s without data transfer. We also observed that a 
high rate, with which the SuO writes into the frame buffer 
can slow down the R2U2/E cycle rate. Note that here, we 
are measuring means of minimal execution times for R2U2 
operations. In a regular application, R2U2 would be executed 
with a fixed update rate R, which must be larger than our 
experimentally obtained maximum update rate. 


In order to obtain execution times for each TL operator, we 
use typical formulas from prior case studies, (Moosbrugger 
et al., 2017; Schumann, Moosbrugger, & Rozier, 2016; Geist 
et al., 2014; Schumann, Roychoudhury, & Kulkarni, 2015) or 
artificial formulas running on regular input traces. 


Table 1 shows mean execution times for Boolean operators 
(V,A,—,—), the o (previously) operator, as well as MTL 


operators with time points (e.g., Lj10)p) and intervals (e.g., 
[5,10]}P). AS a comparison, we measured the run-times of 
the software version of R2U2 as called from the octave! sys- 
tem. Here again, we ignore the interface overhead to octave. 
In a sequential mode, the single operators take about twice the 
time on the EPIPHANY chip. However, as soon as more than 
2 workers are operating at full speed, R2U2/E will be faster 
than R2U2 running on the ARM CPU. Theoretically, a max- 
imum speed-up of about 7 over the ARM execution should 
be achievable with a 16 core EPIPHANY chip. On a 64 core 
chip, a considerably higher speed-up of about 30 should be 
possible. 


Operation ttus]’ | tfus]? 
Master (M) 13.1 NA 
M with transfer 28.2 NA 
M with high bandwidth transfer | 38.4 NA 
Boolean 0.23 0.12 
Temporal (time point) 0.4 0.2 
Temporal (interval) 0.8 0.45 


Table 1. Basic performance for R2U2/E (t) and R2U2 run- 
ning on ARM A9 (*) 


—Single-core, 5 TL 
100 H—-Single-core, 0 TL 
—16-core, 0 TL 

80 }}—16 core, 5 TL 


i 1 1 1 
0 50 100 150 200 250 300 
A Number of Instructions 


150 T T T 


o —Program size=255 
= — Program size=73 


10 15 20 25 
B Number of Temporal Operators 


Figure 5. Performance of R2U2/E 


The overall calculation time for our models depends on the 
number as well as the types of TL operators in the formula. 
Figure 5A shows the mean execution time of two different 
formulas being executed. The number of operators in the for- 
mula corresponds to the number of instructions of the TL pro- 
cessor. Each of them have been executed on a single core, as 
well as on our 16 core implementation. Times have been mea- 
sured without data transfer. The red and the magenta traces 
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are formulas that contain 5 temporal logic operators with tim- 
ing constraints, whereas the blue and the green trace do only 
contain only temporal logic operators without intervals. 


As Figure 5A shows, the execution time grows linearly with 
the size of formula, as long as the type of the TL operators 
does not change. Figure 5B indicates how an increasing num- 
ber of temporal operators lead to slightly increased (sequen- 
tial) execution times. The relatively low number of temporal 
operators compared to Boolean operators is indicative of most 
of our R2U2 models. 


Figure 5A also shows how the non-trivial time consumed for 
synchronization and data transfer influences the overall sys- 
tem behavior. Even with a formula length of 255 operators 
(currently the maximum for this configuration) and optimal 
load balancing, the execution time on 16 cores is still domi- 
nated by the synchronization overhead of 13.1s. 
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Figure 6. Speed-ups for R2U2 model with 198 Boolean and 
36 temporal operators 


The substantial data transfer and synchronization overhead 
reduces the achievable speed-up for smaller R2U2 models. 
Figure 6 shows the speed-up curves for an existing R2U2 
model. Brown and blue lines show the measured speed-up 
values against running R2U2/E with one TL core for a ran- 
dom mapping of formulas to the cores (blue) and for a near- 
optimal mapping (brown). The execution times of the host 
and master component of R2U2/E limit the maximal speed 
up to approximately 3 for this formula (red line in Figure 6), 
which is achieved when a larger number of cores are active. 
The dashed blue line corresponds to the maximal speed-up 
ignoring communication and synchronization overhead. It 
shows that for this R2U2 formula only 3 cores could be kept 
busy in a meaningful manner. This in an indication that the 
R2U2/E architecture is most beneficial for large and complex 
sets of temporal formulas. 


For the R2U2/E implementation, we made initial measure- 
ments on the overall power consumption. Execution of the TL 
engine on the ARM CPU (ignoring all data transfer) uses ap- 
proximately 0.2W. 15 workers on the EPIPHANY chip take 
0.35W of power (again ignoring all data transfer), yielding 
a reduction in power usage by a factor of approximately 8. 
The overall R2U2/E architecture on the host side and on the 
EPIPHANY chip uses approximately 1.5W. Since we could 
not measure the power consumption of the individual compo- 


nents of the Parallella board separately, these numbers should 
be regarded with caution. 


6. CONCLUSIONS 


In this paper, we presented R2U2/E, a parallel and unobtru- 
sive architecture of the R2U2 runtime monitoring system, 
which is running on a parallel EPIPHANY chip with 16 or 
64 cores. The execution of R2U2 on a co-processor relieves 
the flight computer from computational burdens for monitor- 
ing and makes R2U2 substantially more unobtrusive. With a 
DMA-based architecture for the access of variables of inter- 
est, the flight software does not need to be instrumented or 
otherwise modified and can access these data without chang- 
ing the behavior of the flight software in a noticeable manner. 


On the EPIPHANY chip, a master processor handles synchro- 
nization and data transport and is in charge of controlling all 
workers that work in lock-step on the given input data. Inde- 
pendent subformulas can be executed in parallel on different 
processors, providing a good potential for high speed ups for 
most R2U2 models. 


Although R2U2/E can achieve timestamp rates for the tem- 
poral logic engine of about 20kHz, there is much room for 
improvement of this architecture. In particular, the transfer 
of the data from SuO via the host-side R2U2 currently is the 
main bottleneck prohibiting higher update rates. Direct mem- 
ory access of the R2U2 master on the shared buffer should 
solve this issue. Furthermore, R2U2/E on the EPIPHANY 
will be extended to handle control and should only notify the 
host-side of R2U2 in case properties are violated, which is 
expected to happen only at a low rate. 


The current implementation of R2U2/E replicates the formu- 
las over all cores, thus wasting precious local memory. Small 
changes to the R2U2 modeling and compilation tool chain 
will allow us to monitor more and larger properties, taking 
advantage of the inherent speedup. Finally, we are planning 
to investigate how to specify R2U2 models, which contain 
components with bounded, but high computational require- 
ments, e.g., large Bayesian networks or model-based particle 
filters and how to efficiently map them to our R2U2/E archi- 
tecture. 
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